Statistical Acquisition of Terminology Dictionary
نویسندگان
چکیده
Terminologies are specialized words and compound words used in a particular domain, such as computer science. Since they are very common in scientific articles, the ability to automatic identification of terminology could greatly assist any domain related natural language processing applications. Unfortunately, the collection of terminology information is very difficult and requires much tedious and time consuming manual work. In this paper, a semi-automatic approach is developed to extract technical words and phrases from on-line corpora. This approach can significantly reduce the manual effort in the generation of terminology dictionary. First, those domain specific words which have no entries in the universal dictionary are identified. Second, terminology words are extracted from these new words as well as the universal dictionary. Then compound words are extracted from the combination of terminology words and other words. The final computer terminology dictionary contains 1,034 words and 3,471 compound words. Experiment shows that 89.5 percent of all the occurrences of computer terminology can be identified with this terminology dictionary. keyword: Chi-square Test, Automatic Indexing, Mutual Information
منابع مشابه
Inducing Terminology for Lexical Acquisition
Few attention has been paid to terminology extraction for what concerns the possibilities it offers to corpus linguistics and lexical acquisition. The problem of detecting terms in textual corpora has been approached in a complex framework. Terminology is seen as the acquisition of domain specific knowledge (i.e. semantic features, selectional restrictions) for complex terms and /or unknown wor...
متن کاملSemi-Automatic Acquisition of Domain-Specific Translation Lexicons
We investigate the utility of an algorithm for translation lexicon acquisition (SABLE), used previously on a very large corpus to acquire general translation lexicons, when that algorithm is applied to a much smaller corpus to produce candidates for domain-specific translation lexicons. 1 I n t r o d u c t i o n Reliable translation lexicons are useful in many applications, such as cross-langua...
متن کاملWord Knowledge Acquisition for Computational Lexicon Construction
The growing of multilingual information processing technology has created the need of linguistic resources, especially lexical database. Many attempts were put to alter the traditional dictionary to computational dictionary, or widely named as computational lexicon. TCL’s Computational Lexicon (TCLLEX) is a recent development of a large-scale Thai Lexicon, which aims to serve as a fundamental l...
متن کاملPost-MT Term Swapper: Supplementing a Statistical Machine Translation System with a User Dictionary
A statistical machine translation (SMT) system requires homogeneous training data in order to get domain-sensitive (or context-sensitive) terminology translations. If the data consists of various domains, it is difficult for an SMT system to learn context-sensitive terminology mappings probabilistically. Yet, terminology translation accuracy is an important issue for MT users. This paper explor...
متن کاملCreating a medical dictionary using word alignment: The influence of sources and resources
BACKGROUND Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automatic word alignment of the English and Swedish versions of the medical terminology systems ICD-10, IC...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997